在實際的資料分析中,通常需同時考量多個自變項對依變項的影響,自變項的資料型態可以是連續型的或類別型的,藉此建立較符合實際情況的羅吉斯迴歸模式,當自變數超過一個以上,則稱為多元羅吉斯迴歸分析(multivariate logistic regression),又可稱為多元或複羅吉斯迴歸分析。
針對河盲症研究何種因子與感染風險有關。
A data frame with 1302 observations on the following 7 variables. (individual data)
| 變數名稱 | 變數涵義 | 資料編碼方式 | 資料型態 |
|---|---|---|---|
| id | subject unique id number | 類別 | |
| mf | Microfiliarial infection | 0 = no; 1 = yes | 類別 |
| area | Area of residence | 0 = Savannah; 1 = Rainforest |
類別 |
| agegrp | age group | 0 = 5-9; 1 = 10-19; 2 = 20-39; 3 = ≥40 | 類別 |
| sex | gender | 0 = male; 1 = female | 類別 |
| mfload | No. of mf in skin snip | 0 = none; 1 = 1-9; 2 = 10-49; 3 = ≥50 | 類別 |
| lesions | Presence of any eye lesions | 0 = no; 1 = yes | 類別 |
Part 1 建立多元羅吉斯迴歸模型
探討區域對於微絲蟲感染之影響?
探討年齡對於微絲蟲感染之影響?
探討性別對於微絲蟲感染之影響?
在調整區域性與年齡層後,微絲蟲感染與性別之關係?
Part 2 ROC 曲線
# 匯入資料
# getwd() #查看目前所在資料夾位置
# setwd('PATH/') # win: 'D:/' or 'C:/'; mac: '~/Desktop/'
dat <- read.csv('Oncho_ems.csv', header = TRUE, stringsAsFactors = TRUE)
# 查看前六筆資料
head(dat)
## id mf area agegrp sex mfload lesions
## 1 1 1 0 2 1 1 0
## 2 2 1 1 3 0 3 0
## 3 3 1 0 3 1 1 0
## 4 4 0 1 2 1 0 0
## 5 5 0 0 3 1 0 0
## 6 6 0 1 2 1 0 0
# 查看資料大小
dim(dat)
## [1] 1302 7
這個資料有7欄,1302橫列。
# 查看變數資料型態
str(dat)
## 'data.frame': 1302 obs. of 7 variables:
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ mf : int 1 1 1 0 0 0 1 1 1 1 ...
## $ area : int 0 1 0 1 0 1 0 1 1 1 ...
## $ agegrp : int 2 3 3 2 3 2 3 2 3 3 ...
## $ sex : int 1 0 1 1 1 1 0 1 0 1 ...
## $ mfload : int 1 3 1 0 0 0 1 1 1 3 ...
## $ lesions: int 0 0 0 0 0 0 0 0 0 1 ...
由編碼簿可以知道所有變數應該都要是類別型。
# 轉換變數型態
dat$id = as.factor(dat$id)
dat$mf = as.factor(dat$mf)
dat$area = as.factor(dat$area)
dat$agegrp = as.factor(dat$agegrp)
dat$sex = as.factor(dat$sex)
dat$mfload = as.factor(dat$mfload)
dat$lesions = as.factor(dat$lesions)
# 再次查看變數資料型態
str(dat)
## 'data.frame': 1302 obs. of 7 variables:
## $ id : Factor w/ 1302 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ mf : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 2 2 2 2 ...
## $ area : Factor w/ 2 levels "0","1": 1 2 1 2 1 2 1 2 2 2 ...
## $ agegrp : Factor w/ 4 levels "0","1","2","3": 3 4 4 3 4 3 4 3 4 4 ...
## $ sex : Factor w/ 2 levels "0","1": 2 1 2 2 2 2 1 2 1 2 ...
## $ mfload : Factor w/ 4 levels "0","1","2","3": 2 4 2 1 1 1 2 2 2 4 ...
## $ lesions: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 2 ...
lm3 = glm(mf ~ area, data = dat, family = binomial(link = "logit"))
summary(lm3)
##
## Call:
## glm(formula = mf ~ area, family = binomial(link = "logit"), data = dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5900 -1.1992 0.8148 0.8148 1.1558
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.05111 0.08546 0.598 0.55
## area1 0.88102 0.11767 7.487 7.05e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1714.1 on 1301 degrees of freedom
## Residual deviance: 1657.0 on 1300 degrees of freedom
## AIC: 1661
##
## Number of Fisher Scoring iterations: 4
# Exponentiated coefficients (odds ratios)
exp(coef(lm3))
## (Intercept) area1
## 1.052434 2.413363
# 2.413363是area的勝算比(odds ratio)
根據summary(lm3),可以將模式寫成以下方程式:
\(logit(\hat{p}_{被微絲蟲感染}) = log(\frac{\hat{p}_{被微絲蟲感染}}{1-\hat{p}_{被微絲蟲感染}}) = 0.051 + 0.881area\)
上列迴歸方程式可解釋為住在Rainforest地區的人被微絲蟲感染的勝算(odds)是住在Savannah地區的2.413(\(e^{0.881}\))倍,因此可以知道住在Rainforest地區的人會比住在Savannah地區更容易被感染。
(area: Area of residence (0 savannah, 1 forest))
\(7.05 \times 10^{-14}\)為檢定\(H_0: \beta_{area}=0\)之\(p\)值,p-value = \(7.05 \times 10^{-14} < 0.05\),故在此模型中area的影響在統計上是顯著的。
lm4 = glm(mf ~ agegrp, data = dat, family = binomial(link = "logit"))
summary(lm4)
##
## Call:
## glm(formula = mf ~ agegrp, family = binomial(link = "logit"),
## data = dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8681 -0.7189 0.6196 0.8358 1.7203
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.2212 0.1678 -7.279 3.37e-13 ***
## agegrp1 1.0372 0.2160 4.802 1.57e-06 ***
## agegrp2 2.0933 0.1987 10.534 < 2e-16 ***
## agegrp3 2.7741 0.2081 13.332 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1714.1 on 1301 degrees of freedom
## Residual deviance: 1455.7 on 1298 degrees of freedom
## AIC: 1463.7
##
## Number of Fisher Scoring iterations: 4
# Exponentiated coefficients (odds ratios)
exp(coef(lm4))
## (Intercept) agegrp1 agegrp2 agegrp3
## 0.2948718 2.8213372 8.1120000 16.0239130
根據summary(lm4),可以將模式寫成以下方程式:
\(logit(\hat{p}_{被微絲蟲感染}) = log(\frac{\hat{p}_{被微絲蟲感染}}{1-\hat{p}_{被微絲蟲感染}}) = -1.221 + 1.037agegrp1 + 2.093agegrp2 + 2.774agegrp3\)
上列迴歸方程式可解釋為年齡層在10-19歲的人被微絲蟲感染的勝算是年齡層在5-9歲的2.821(\(e^{1.037}\))倍; 年齡層在20-39歲的人被微絲蟲感染的勝算是年齡層在5-9歲的8.112(\(e^{2.093}\))倍; 年齡層在>=40歲的人被微絲蟲感染的勝算是年齡層在5-9歲的16.024(\(e^{2.774}\))倍。 因此可以知道年紀越大的人越容易被感染。
(agegrp: Age group (0: 5-9, 1: 10-19, 2: 20-39, 3: >=40))
另外,可以藉由log likelihood ratio test執行以下檢定:
\(H_{0}: \beta_{agegrp1} = \beta_{agegrp2} = \beta_{agegrp3} = 0\)
\(H_{A}: \beta_{agegrp1} \neq 0 \text{ or } \beta_{agegrp2} \neq 0 \text{ or } \beta_{agegrp3} \neq 0\)
(註:也就是檢定在這個模型裡,agegrp變數整體而言是否為顯著的變數)
library(car)
## Loading required package: carData
Anova(lm4, test = "LR", type = "II")
## Analysis of Deviance Table (Type II tests)
##
## Response: mf
## LR Chisq Df Pr(>Chisq)
## agegrp 258.4 3 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
G=258.4,p-value(< 2.2e-16) <0.05,因此agegrp這個變數在此模型裡是個顯著的變數。
lm6 = glm(mf ~ sex, data = dat, family = binomial(link = "logit"))
summary(lm6)
##
## Call:
## glm(formula = mf ~ sex, family = binomial(link = "logit"), data = dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5338 -1.3122 0.8589 1.0483 1.0483
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.80742 0.08724 9.255 < 2e-16 ***
## sex1 -0.49588 0.11655 -4.255 2.09e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1714.1 on 1301 degrees of freedom
## Residual deviance: 1695.7 on 1300 degrees of freedom
## AIC: 1699.7
##
## Number of Fisher Scoring iterations: 4
# Exponentiated coefficients (odds ratios)
exp(coef(lm6))
## (Intercept) sex1
## 2.2421053 0.6090335
根據summary(lm6),可以將模式寫成以下方程式:
\(logit(\hat{p}_{被微絲蟲感染}) = log(\frac{\hat{p}_{被微絲蟲感染}}{1-\hat{p}_{被微絲蟲感染}}) = 0.807 - 0.496sex\)
上列迴歸方程式可解釋為女生被感染的勝算是男生的0.609(\(e^{-0.496}\))倍,因此可以知道男生比女生更容易被微絲蟲感染。
(sex: Sex (0: male, 1: female))
\(2.09 \times 10^{-5}\)為檢定\(H_0: \beta_{sex}=0\)之\(p\)值,p-value < 0.05,故在此模型中sex的影響是顯著的。
lm7 = glm(mf ~ sex + agegrp + area, data = dat, family = binomial(link = "logit"))
summary(lm7)
##
## Call:
## glm(formula = mf ~ sex + agegrp + area, family = binomial(link = "logit"),
## data = dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2212 -0.7666 0.5392 0.7081 2.1459
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.6157 0.2066 -7.820 5.30e-15 ***
## sex1 -0.5813 0.1356 -4.286 1.82e-05 ***
## agegrp1 0.9429 0.2239 4.211 2.54e-05 ***
## agegrp2 2.3478 0.2108 11.138 < 2e-16 ***
## agegrp3 2.8713 0.2171 13.225 < 2e-16 ***
## area1 1.1227 0.1386 8.099 5.52e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1714.1 on 1301 degrees of freedom
## Residual deviance: 1366.1 on 1296 degrees of freedom
## AIC: 1378.1
##
## Number of Fisher Scoring iterations: 4
# Exponentiated coefficients (odds ratios)
exp(coef(lm7))
## (Intercept) sex1 agegrp1 agegrp2 agegrp3 area1
## 0.1987542 0.5591696 2.5673568 10.4623748 17.6593452 3.0731378
根據summary(lm7),可以將模式寫成以下方程式:
\(logit(\hat{p}_{被微絲蟲感染}) = log(\frac{\hat{p}_{被微絲蟲感染}}{1-\hat{p}_{被微絲蟲感染}}) = -1.616 - 0.581sex + 0.943agegrp1 + 2.348agegrp2 + 2.871agegrp3 + 1.123area\)
上列迴歸方程式可解釋為住在同一地區且年齡層相同的人,女性被感染的勝算是男性的0.559(\(e^{-0.581}\))倍;因此可以知道對於住在同一地區且年齡層相同的人而言,男性比女性更容易被微絲蟲感染。
lm7為例# install.packages("pROC")
library(pROC)
## Warning: package 'pROC' was built under R version 4.0.5
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
Prob(被微絲蟲感染)的預測值(pre)dat$pre = predict(lm7, dat, type = "response")
head(dat)
## id mf area agegrp sex mfload lesions pre
## 1 1 1 0 2 1 1 0 0.5376278
## 2 2 1 1 3 0 3 0 0.9151558
## 3 3 1 0 3 1 1 0 0.6624600
## 4 4 0 1 2 1 0 0 0.7813405
## 5 5 0 0 3 1 0 0 0.6624600
## 6 6 0 1 2 1 0 0 0.7813405
這是剛剛在part 1估計出來的lm7模式:
\(logit(\hat{p}_{被微絲蟲感染}) = log(\frac{\hat{p}_{被微絲蟲感染}}{1-\hat{p}_{被微絲蟲感染}}) = -1.616 - 0.581sex + 0.943agegrp1 + 2.348agegrp2 + 2.871agegrp3 + 1.123area\)
代入數值即可求得pre,例如:
id 1 : \(logit(\hat{p}_{被微絲蟲感染}) = log(\frac{\hat{p}_{被微絲蟲感染}}{1-\hat{p}_{被微絲蟲感染}}) = -1.616 - 0.581 \times 1 + 0.943 \times 0 + 2.348 \times 1 + 2.871 \times 0 + 1.123 \times 0 = 0.151\)
\(\hat{p}_{被微絲蟲感染} = \frac{1}{1 + e^{-0.151}} = 0.5376784\)
id 2 : \(logit(\hat{p}_{被微絲蟲感染}) = log(\frac{\hat{p}_{被微絲蟲感染}}{1-\hat{p}_{被微絲蟲感染}}) = -1.616 - 0.581 \times 0 + 0.943 \times 0 + 2.348 \times 0 + 2.871 \times 1 + 1.123 \times 1 = 2.378\)
\(\hat{p}_{被微絲蟲感染} = \frac{1}{1 + e^{-2.378}} = 0.9151\)
rocobj = roc(dat$mf, dat$pre) # roc(實際觀察到的結果, 藉由模式(lm7)預測出來的結果)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(rocobj,
print.thres = TRUE, # print.thres = TRUE: the threshold with the highest sum sensitivity + specificity is plotted
grid = TRUE)
rocobj$auc
## Area under the curve: 0.7924
ci.auc(rocobj)
## 95% CI: 0.7669-0.8179 (DeLong)
可得0.669為最佳切點,此時的Sensitivity:0.680,Specificity:0.771。
AUC = 0.7924 (0.7669, 0.8179)
(註: 最佳切點的判斷依據是希望 sensitivity + specificity 有最大值)
在不同切點(cutoff value, threshold)下的sensitivity和specificity:
data.frame(threshold = rocobj$thresholds,
sensitivity = rocobj$sensitivities,
specificity = rocobj$specificities)
## threshold sensitivity specificity
## 1 -Inf 1.0000000 0.00000000
## 2 0.1329109 0.9939173 0.08541667
## 3 0.1938949 0.9805353 0.16041667
## 4 0.2382886 0.9708029 0.23125000
## 5 0.2962281 0.9513382 0.32708333
## 6 0.3585290 0.9343066 0.36041667
## 7 0.4231920 0.9172749 0.42916667
## 8 0.5024109 0.8819951 0.50833333
## 9 0.5741204 0.8004866 0.63958333
## 10 0.6365365 0.7420925 0.70416667
## 11 0.6688628 0.6800487 0.77083333
## 12 0.7267649 0.6119221 0.81666667
## 13 0.7798023 0.5279805 0.86458333
## 14 0.8195604 0.4111922 0.92291667
## 15 0.8612349 0.2603406 0.93958333
## 16 0.8899226 0.1630170 0.96458333
## 17 Inf 0.0000000 1.00000000
可以利用曲線下的面積(AUC)來判別ROC曲線的鑑別力,AUC 數值的範圍從0到1,數值愈大愈好。以下為AUC 數值一般的判別規則:
AUC=0.5 (無鑑別力)
0.7≦AUC≦0.8 (可接受的鑑別力)
0.8≦AUC≦0.9 (優良的鑑別力)
0.9≦AUC≦1.0 (極佳的鑑別力)
[觀察值(data) → 建立logistic regression model → 預測值pre → cutoff point → sensitivity & specificity → ROC曲線]
當cutoff point = 0.669時,如果Prob(被微絲蟲感染)的預測值pre>0.669,這個人就會被判定成「會被感染」,反之,若pre<0.669,則判定為「不會被感染」。
out = ifelse(dat$pre > 0.669, 1, 0)
接著,把實際上觀察到的結果dat$mf,和剛剛透過切點(0.669)判斷出來的結果out,計算各種可能組合的次數,可得下面這個列連表:
addmargins(table(mf = dat$mf, predict = out))
## predict
## mf 0 1 Sum
## 0 370 110 480
## 1 263 559 822
## Sum 633 669 1302
這時候就可以算sensitivity和specificity了!
sensitivity = 559/822 = 0.6800487
specificity = 370/480 = 0.7708333
針對前列腺癌研究何種指標能夠有效進行疾病預測。
A data frame with 380 observations on the following 9 variables.
| 變數名稱 | 變數涵義 | 資料編碼方式 | 資料型態 |
|---|---|---|---|
| ID | Identification code | 類別 | |
| CAPSULE | Tumor penetration of prostatic capsule | 0 = No penetration; 1 = Penetration |
類別 |
| AGE | Age in years | 連續 | |
| RACE | Race | 1 = White; 2 = Black | 類別 |
| DPROS | Results of the digital rectal exam | 1 = No nodule; 2 = Unilobar nodule (left); 3 = Unilobar nodule (right); 4 = Bilobar nodule (left) |
類別 |
| DCAPS | Detection of capsular involvement in rectal exam | 1 = No; 2 = Yes | 類別 |
| PSA | Prostatitis specific antigen value (mg/ml) | 連續 | |
| VOL | Tumor volume obtained from ultrasound (cm3) | 連續 | |
| GLEASON | Total gleason score | 連續 | |
以DPROS 為解釋變數進行羅吉斯迴歸模式分析。請列出迴歸係數,並說明DPROS對於CAPSULE的影響(計算OR,並以OR 的角度解釋)。請以likelihood ratio test 檢定DPROS 變數是否顯著。
以PSA為解釋變數進行羅吉斯迴歸模式分析。請列出迴歸係數,並說明PSA對於CAPSULE的影響(計算OR,並以OR 的角度解釋)。
以DPROS, PSA與DCAPS為解釋變數進行羅吉斯迴歸模式分析。請列出迴歸係數,並說明各變數對於CAPSULE的影響(計算OR,並以OR 的角度解釋)。請以likelihood ratio test檢定變數是否顯著。請畫出此模型的ROC曲線與計算ROC曲線下的面積,說明其最佳cutoff value 的sensitivity與specificity為何?並解釋此ROC曲線。